[fix][test] Fix ManagedCursorTest and NonDurableCursorTest flaky tests #25101

oneby-wang · 2025-12-21T04:35:59Z

Motivation

ManagedCursorTest.asyncMarkDeleteBlocking() test is very flaky after PR #25087, so fix it.

Modifications

When doing local debug, I see many logs like this:

org.apache.bookkeeper.mledger.ManagedLedgerException: org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$MarkDeletingMarkedPosition: Mark deleting an already mark-deleted position. Current mark-delete: 13:9 -- attempted mark delete: 13:8
Caused by: org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$MarkDeletingMarkedPosition: Mark deleting an already mark-deleted position. Current mark-delete: 13:9 -- attempted mark delete: 13:8
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setAcknowledgedPosition(ManagedCursorImpl.java:2076)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.asyncMarkDelete(ManagedCursorImpl.java:2212)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.maybeUpdateCursorBeforeTrimmingConsumedLedger(ManagedLedgerImpl.java:2736)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$14.operationComplete(ManagedLedgerImpl.java:1768)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$14.operationComplete(ManagedLedgerImpl.java:1747)
	at org.apache.bookkeeper.mledger.impl.MetaStoreImpl.lambda$asyncUpdateLedgerIds$7(MetaStoreImpl.java:187)

So I modify if (!lastAckedPosition.equals(cursor.getMarkDeletedPosition())) to if (lastAckedPosition.compareTo(cursor.getMarkDeletedPosition()) > 0) in ManagedLedgerImpl.

Modify some typos, and add some todos(could be deleted if not needed) in ManagedLedgerImpl.
Add simplest test case method: asyncMarkDeleteBlockingWithOneShot(), see method comments. I move c1.asyncMarkDelete out of asyncAddEntry#addComplete callback method, because the two operations will compete for the same ledgerId due to race condition, which is the reason why asyncMarkDeleteBlocking() test is more flaky after [fix][broker] Fix cursor position persistence in ledger trimming #25087. Some logs if I put c1.asyncMarkDelete in asyncAddEntry#addComplete callback method.

2025-12-19T21:58:48,996 - INFO  - [main:ManagedCursorTest@1405] - size: 100, positions: [3:0, 3:1, 3:2, 3:3, 3:4, 3:5, 3:6, 3:7, 3:8, 3:9, 5:0, 5:1, 5:2, 5:3, 5:4, 5:5, 5:6, 5:7, 5:8, 5:9, 7:0, 7:1, 7:2, 7:3, 7:4, 7:5, 7:6, 7:7, 7:8, 7:9, 9:0, 9:1, 9:2, 9:3, 9:4, 9:5, 9:6, 9:7, 9:8, 9:9, 11:0, 11:1, 11:2, 11:3, 11:4, 11:5, 11:6, 11:7, 11:8, 11:9, 13:0, 13:1, 13:2, 13:3, 13:4, 13:5, 13:6, 13:7, 13:8, 13:9, 14:0, 14:1, 14:2, 14:3, 14:4, 14:5, 14:6, 14:7, 14:8, 14:9, 16:0, 16:1, 16:2, 16:3, 16:4, 16:5, 16:6, 16:7, 16:8, 16:9, 18:0, 18:1, 18:2, 18:3, 18:4, 18:5, 18:6, 18:7, 18:8, 18:9, 19:0, 19:1, 19:2, 19:3, 19:4, 19:5, 19:6, 19:7, 19:8, 19:9]

2025-12-19T21:58:48,625 - INFO  - [main:ManagedCursorTest@1405] - size: 100, positions: [3:0, 3:1, 3:2, 3:3, 3:4, 3:5, 3:6, 3:7, 3:8, 3:9, 5:0, 5:1, 5:2, 5:3, 5:4, 5:5, 5:6, 5:7, 5:8, 5:9, 7:0, 7:1, 7:2, 7:3, 7:4, 7:5, 7:6, 7:7, 7:8, 7:9, 8:0, 8:1, 8:2, 8:3, 8:4, 8:5, 8:6, 8:7, 8:8, 8:9, 10:0, 10:1, 10:2, 10:3, 10:4, 10:5, 10:6, 10:7, 10:8, 10:9, 11:0, 11:1, 11:2, 11:3, 11:4, 11:5, 11:6, 11:7, 11:8, 11:9, 13:0, 13:1, 13:2, 13:3, 13:4, 13:5, 13:6, 13:7, 13:8, 13:9, 15:0, 15:1, 15:2, 15:3, 15:4, 15:5, 15:6, 15:7, 15:8, 15:9, 17:0, 17:1, 17:2, 17:3, 17:4, 17:5, 17:6, 17:7, 17:8, 17:9, 19:0, 19:1, 19:2, 19:3, 19:4, 19:5, 19:6, 19:7, 19:8, 19:9]

Add asyncMarkDeleteBlockingWithMultiShots() test to reproduce race condition between c1.asyncMarkDelete and asyncAddEntry#addComplete, see method comments.
Fix asyncMarkDeleteBlocking() flaky test, see method comments.
a. Fix MetadataStoreException$BadVersionException due to race condition between ledger rollover and ledger recovery read.
b. Fix assert failure due to race condition race condition between c1.asyncMarkDelete and asyncAddEntry#addComplete
And more tests, see PR discussion.

Verifying this change

Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: oneby-wang#16

…kDeleteBlockingWithMultiShots flaky test

…thMultiShots flaky test

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

…tionImpl()

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java

lhotari

Check the comments about preferring AssertJ

…asserted values

lhotari · 2025-12-23T08:36:34Z

@oneby-wang There seems to be a few more locations where the same flakiness pattern occurs. please check at least cursorPersistenceAsyncMarkDeleteSameThread. testLazyCursorLedgerCreationForSubscriptionCreation appeared flaky too and it would be worth checking whether it's the same problem.
Please also check NonDurableCursorTest.asyncMarkDeleteBlocking.

oneby-wang · 2025-12-23T10:25:39Z

Got the two above comments.

oneby-wang · 2025-12-23T10:31:56Z

@lhotari Maybe we should do asyncMarkDelete if we reach this condition? Since this is the root casue to inconsistency.

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Lines 3010 to 3012 in fbab357

    
           if (ledgerInfo != null && ledgerInfo.getLedgerId() != currentLedger.getId() 
        
                   && ledgerInfo.getEntries() == slowestReaderPosition.getEntryId() + 1) { 
        
               slowestReaderLedgerId = slowestReaderPosition.getLedgerId() + 1;

lhotari · 2025-12-23T11:09:46Z

@lhotari Maybe we should do asyncMarkDelete if we reach this condition? Since this is the root casue to inconsistency.

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Lines 3010 to 3012 in fbab357

if (ledgerInfo != null && ledgerInfo.getLedgerId() != currentLedger.getId()

&& ledgerInfo.getEntries() == slowestReaderPosition.getEntryId() + 1) {

slowestReaderLedgerId = slowestReaderPosition.getLedgerId() + 1;

@oneby-wang I guess that could make sense, however the problem is that there could be multiple cursors at the slowest position. ManagedCursorContainerImpl/ManagedCursorContainer doesn't have a sorted data structure which could be iterated. ActiveManagedCursorContainer has the method getNumberOfCursorsAtSamePositionOrBefore which solves this detail. It's currently used for activeCursors in ManagedLedgerImpl when cacheEvictionByExpectedReadCount is enabled. It would be possible to use it also for other use cases. The performance of ActiveManagedCursorContainerImpl is better than ManagedCursorImpl also for the ordinary use cases. There are micro benchmarks in the microbench module.

However, it might be better to handle it separately. Just wondering if it's already handled by the advanceCursorsIfNecessary method:

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Lines 3109 to 3116 in fbab357

    
           try { 
        
               advanceCursorsIfNecessary(ledgersToDelete); 
        
           } catch (LedgerNotExistException e) { 
        
               log.info("First non deleted Ledger is not found, stop trimming"); 
        
               metadataMutex.unlock(); 
        
               trimmerMutex.unlock(); 
        
               return; 
        
           }

oneby-wang · 2025-12-23T11:47:14Z

Just wondering if it's already handled by the advanceCursorsIfNecessary method:

I read this code. If we mark delete 12:9, next ledger is an empty ledger: 13:-1, seems this method will only move mark delete position to 12:8. If ledger 12 is trimmed, then causing inconsistency.

I'm wondering if mark delete position with -1 entryId is a valid mark delete position? Seems some codes intend to avoid this.

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Lines 3234 to 3236 in fbab357

    
           // Just ack messages like a consumer. Normally, consumers will not confirm a position that does not exist, so 
        
           // find the latest existing position to ack. 
        
           Position highestPositionToDelete = calculateLastEntryInLedgerList(ledgersToDelete);

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Lines 622 to 632 in fbab357

    
           // bypass empty ledgers, find last ledger with Message if possible. 
        
           while (lastConfirmedEntry.getEntryId() == -1) { 
        
               Map.Entry<Long, LedgerInfo> formerLedger = ledgers.lowerEntry(lastConfirmedEntry.getLedgerId()); 
        
               if (formerLedger != null) { 
        
                   LedgerInfo ledgerInfo = formerLedger.getValue(); 
        
                   lastConfirmedEntry = 
        
                           PositionFactory.create(ledgerInfo.getLedgerId(), ledgerInfo.getEntries() - 1); 
        
               } else { 
        
                   break; 
        
               } 
        
           }

EDIT: if a cursor subscribe an empty managed ledger, then it's mark delete start with ledgerId:-1, it is a valid mark delete position.

lhotari · 2025-12-23T12:57:15Z

There seems to be a few more locations where the same flakiness pattern occurs. please check at least cursorPersistenceAsyncMarkDeleteSameThread. testLazyCursorLedgerCreationForSubscriptionCreation appeared flaky too and it would be worth checking whether it's the same problem.
Please also check NonDurableCursorTest.asyncMarkDeleteBlocking.

@oneby-wang Please address this so that we could get at least some improvements made to the flakiness. The NonDurableCursorTest.asyncMarkDeleteBlocking method is really flaky too. I'd like to merge this PR very soon although we haven't addressed all the details that have come up during the review.

oneby-wang · 2025-12-23T13:02:54Z

@lhotari Working.

lhotari · 2025-12-23T13:07:53Z

I read this code. If we mark delete 12:9, next ledger is an empty ledger: 13:-1, seems this method will only move mark delete position to 12:8. If ledger 12 is trimmed, then causing inconsistency.

I'm wondering if mark delete position with -1 entryId is a valid mark delete position? Seems some codes intend to avoid this.

good observations. It seems that there's a lot of code locations where it's hard to determine which is the correct behavior.
There seems to be some inconsistency in how things are handled across different parts of the ManagedLedger implementation. I guess we can focus on addressing the test flakiness.

oneby-wang · 2025-12-23T13:30:36Z

I think maybe we can fix cursor position and ledger inconsistency by advanceCursorsIfNecessary method.

lhotari

LGTM, however it seems that BadVersionException could occur in all cases where the ledger is opened with a different instance without closing the previous one. I think that retry would be needed to address those cases. We could handle that in a separate PR since this PR will already improve significantly the current CI flakiness.

Thanks for the great work, @oneby-wang

oneby-wang · 2025-12-23T14:39:07Z

I’ll revisit the BadVersionExceptions over the next few days and submit a PR to address it. Thanks for reviewing and your great suggestions. @lhotari

oneby-wang · 2025-12-23T14:49:23Z

however it seems that BadVersionException could occur in all cases where the ledger is opened with a different instance without closing the previous one.

I think just in rollover and reopen race condition case. If ledger is not rolling over, which will not update metastore, so reopen ledger will success to update metastore without throwing BadVersionException.

codecov-commenter · 2025-12-23T16:05:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.49%. Comparing base (621b41f) to head (a1ef783).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #25101       +/-   ##
=============================================
+ Coverage     39.14%   74.49%   +35.35%     
- Complexity    13441    33679    +20238     
=============================================
  Files          1842     1899       +57     
  Lines        145491   149655     +4164     
  Branches      16907    17393      +486     
=============================================
+ Hits          56949   111482    +54533     
+ Misses        80889    29311    -51578     
- Partials       7653     8862     +1209

Flag	Coverage Δ
inttests	`26.60% <ø> (+0.02%)`	⬆️
systests	`23.02% <ø> (-0.07%)`	⬇️
unittests	`74.03% <ø> (+38.98%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1406 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

#25101) (cherry picked from commit 9c5e1c3)

oneby-wang added 7 commits December 20, 2025 11:35

[fix][test] Fix ManagedCursorTest.asyncMarkDeleteBlocking() flaky test

d0ba81d

[fix][test] Add comments

51224d4

[fix][test] Analyze root cause, we are almost there

8aaafc1

[fix][test] Rollback asyncMarkDeleteBlocking flaky test, add asyncMar…

539bd53

…kDeleteBlockingWithMultiShots flaky test

[fix][test] Fix asyncMarkDeleteBlocking and asyncMarkDeleteBlockingWi…

0b890fe

…thMultiShots flaky test

[fix][test] Optimize comments

d338d6e

[fix][test] Optimize comments, fix checkstyle

b497646

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 21, 2025

oneby-wang added 2 commits December 21, 2025 18:11

[fix][test] Optimize comments

379ff6e

[fix][test] Optimize comments

88d7feb

oneby-wang commented Dec 22, 2025

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Outdated Show resolved Hide resolved

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 22, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 22, 2025

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 22, 2025

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Outdated Show resolved Hide resolved

[fix][test] Use PositionFactory.create() instead of new ImmutablePosi…

12c1124

…tionImpl()

oneby-wang force-pushed the ManagedCursorTest_asyncMarkDeleteBlocking_flaky_test branch from 2374c98 to 12c1124 Compare December 22, 2025 08:32

lhotari reviewed Dec 22, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 22, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 22, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 22, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 22, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

oneby-wang added 2 commits December 23, 2025 09:35

[fix][test] Rollback ManagedLedgerImpl production code change

3375d32

[fix][test] Avoid using specific positions in assertions

0707d84

lhotari reviewed Dec 23, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 23, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 23, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 23, 2025

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

lhotari reviewed Dec 23, 2025

View reviewed changes

[fix][test] Use AssertJ assertTrue instead of assertTrue to show the …

eaaff0a

…asserted values

[fix][test] Modify incorrect comments

0ad95b3

oneby-wang added 3 commits December 23, 2025 21:41

[fix][test] Fix the rest flaky tests

bae2c6d

[fix][test] Fix the testNumberOfEntriesWithReopen flaky tests

98e3d27

[fix][test] Remove unnecessary info log

a1ef783

lhotari approved these changes Dec 23, 2025

View reviewed changes

oneby-wang changed the title ~~[fix][test] Fix ManagedCursorTest.asyncMarkDeleteBlocking() flaky test~~ [fix][test] Fix ManagedCursorTest and NonDurableCursorTest flaky tests Dec 23, 2025

lhotari merged commit 9c5e1c3 into apache:master Dec 23, 2025
101 of 105 checks passed

lhotari added this to the 4.2.0 milestone Dec 23, 2025

lhotari added release/4.0.9 release/4.1.3 labels Dec 23, 2025

lhotari assigned oneby-wang Dec 23, 2025

lhotari added the cherry-picked/branch-4.1 label Jan 2, 2026

lhotari pushed a commit that referenced this pull request Jan 2, 2026

[fix][test] Fix ManagedCursorTest and NonDurableCursorTest flaky tests (

9bafb84

#25101) (cherry picked from commit 9c5e1c3)

lhotari pushed a commit that referenced this pull request Jan 2, 2026

[fix][test] Fix ManagedCursorTest and NonDurableCursorTest flaky tests (

b4dafdc

#25101) (cherry picked from commit 9c5e1c3)

lhotari added the cherry-picked/branch-4.0 label Jan 2, 2026

[fix][test] Fix ManagedCursorTest and NonDurableCursorTest flaky tests #25101

[fix][test] Fix ManagedCursorTest and NonDurableCursorTest flaky tests #25101

Conversation

oneby-wang commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhotari left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhotari commented Dec 23, 2025

Uh oh!

oneby-wang commented Dec 23, 2025

Uh oh!

oneby-wang commented Dec 23, 2025

Uh oh!

lhotari commented Dec 23, 2025

Uh oh!

oneby-wang commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented Dec 23, 2025

Uh oh!

oneby-wang commented Dec 23, 2025

Uh oh!

lhotari commented Dec 23, 2025

Uh oh!

oneby-wang commented Dec 23, 2025

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

oneby-wang commented Dec 23, 2025

Uh oh!

oneby-wang commented Dec 23, 2025

Uh oh!

Uh oh!

codecov-commenter commented Dec 23, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oneby-wang commented Dec 21, 2025 •

edited

Loading

lhotari left a comment •

edited

Loading

oneby-wang commented Dec 23, 2025 •

edited

Loading